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Abstract 


We  establish  the  convergence  of  a  class  of  Metropolis-type  Markov  chain  annealing 
algorithms  for  global  optimization  of  a  smooth  function  U(*)  on  No  prior  informa¬ 
tion  is  assumed  as  to  what  bounded  region  contains  a  global  minimum.  Our  analysis  is 
based  on  writing  the  Metropolis-type  algorithm  in  the  form  of  a  recursive  stochastic 
algorithm  Xk+j  =  Xjj  —  ak(VU(Xk)  +  -f  b^Wk,  where  {Wjj}  are  independent  stan¬ 
dard  Gaussian  random  variables,  {^k}  (unbounded,  correlated)  random  variables, 
and  ak  =  A/k,  bk  =  /\/ k  log  log  k  for  k  large,  and  then  applying  results  about 

{Xk}  from  [15].  Since  the  analysis  of  {Xk}  in  [15]  is  based  on  the  asymptotic  behavior 
of  the  related  Langevin-type  Markov  diffusion  annealing  algorithm 
dY(t)  =  — VU(Y(t))dt  +  c{t)dW(t),  where  W(»)  is  a  standard  Wiener  process  and 
c(t)  =  '\/c  /\/ logt  for  t  large,  this  work  demonstrates  and  exploits  the  close  relation¬ 
ship  between  the  Markov  chain  and  diffusion  versions  of  simulated  annealing. 
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1.  INTRODUCTION 

Let  U(*)  be  a  real- valued  function  on  some  set  S.  The  global  optimization  problem 
is  to  find  an  element  of  the  set  S*  =  {x:  U(x)  <  U(y)  V  y  €  E}  (assuming  S*  7^  (j>). 
Recently,  there  has  been  alot  of  interest  in  the  simulated  annealing  method  for  global 
optimization.  Annealing  algorithms  were  initially  proposed  for  finite  optimization  (E 
finite),  and  later  developed  for  continuous  optimization  (E  =  IR^).  An  annealing  algo¬ 
rithm  for  finite  optimization  was  first  suggested  in  [l],  [2],  and  is  based  on  simulating  a 
finite-state  Metropolis-type  Markov  chain.  The  Metropolis  algorithm  and  other  related 
algorithms  such  as  the  "heat  bath"  algorithm,  were  originally  developed  as  Markov 
chain  sampling  methods  for  sampling  from  a  Gibbs  distribution  [3].  The  asymptotic 
behavior  of  finite  state  Metropolis-type  annealing  algorithms  has  been  extensively 
analyzed  [4]- [9]. 

A  continuous  time  annealing  algorithm  for  continuous  optimization  was  first  sug¬ 
gested  in  [10],  [11]  and  is  based  on  simulating  a  Langevin-type  Markov  diffusion: 

dY(t)  =  -VU(Y(t))dt  -b  c(t)dW(t)  .  (1.1) 

Here  U(*)  is  a  smooth  function  on  IR'^,  W(*)  is  a  standard  d-dimensional  Wiener  process, 
and  c(*)  is  a  positive  function  with  c(t)  — >■  0  as  t  — ►  00.  In  the  terminology  of  simulated 
annealing  algorithms,  U(x)  is  called  the  energy  of  state  x,  and  T(t)  =c^(t)/2  is  called 
the  temperature  at  time  t.  Note  that  for  a  fixed  temperature  T(t)  =  T,  the  resulting 
Langevin  diffusion  like  the  Metropolis  chain  has  a  Gibbs  distribution  oc  exp(— U(x)/T)  as 
its  unique  invariant  distribution.  Now  (1.1)  arises  by  adding  slowly  decreasing  white 
Gaussian  noise  to  the  continuous  time  gradient  algorithm 

z(t)  =  — VU(z(t))  .  (1.2) 

The  idea  behind  using  (l.l)  instead  of  (1.2)  for  minimizing  U(*)  is  to  avoid  getting 
trapped  in  strictly  local  minima.  The  asymptotic  behavior  of  Y(t)  as  t  — >-00  has  been 
studied  in  [10],  [12]- [14].  In  [10],  [14]  convergence  results  were  obtained  for  a  version  of 
(1.1)  which  was  modified  to  constrain  the  trajectories  to  lie  in  a  fixed  bounded  set  (and 
hence  is  only  applicable  to  global  optimization  over  a  compact  subset  of  IR*^);  in  [12], 
[13]  results  were  obtained  for  global  optimization  over  all  of  IR'^.  Chiang,  Hwang  and 
Sheu’s  main  result  from  [12]  can  be  roughly  stated  as  follows:  if  U(*)  is  suitably 
behaved  and  c^  (t)  =  C  /log  t  for  t  large  with  C  >  Cq  (a  constant  depending  only  on 
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U(*)),  then  Y(t)  — ►  S  as  t  — ►  cx)  in  probability. 

A  discrete  time  annealing  algorithm  for  continuous  optimization  was  suggested  in 

[14] ,  [15]  and  is  based  on  simulating  a  recursive  stochastic  algorithm 

Xi+,  =  Xt  -  ak(VU(Xk)  +  &)  +  btWk  .  (1.3) 

Here  U(*)  is  again  a  smooth  function  on  IR*^,  is  a  sequence  of  1R‘^ -valued  random 
variables,  {W^}  is  a  sequence  of  independent  standard  d-dimensional  Gaussian  random 
variables,  and  {ajc},  {bk}  are  sequences  of  positive  numbers  with  ak,bic  — ^  0  as  k  — ^  oo. 
The  algorithm  (1.3)  could  arise  from  a  discretization  or  numerical  integration  of  the 
diffusion  (1.1)  so  as  to  be  suitable  for  implementation  on  a  digital  computer;  in  this  case 
is  due  to  the  discretization  error.  Alternatively,  the  algorithm  (1.3)  could  arise  by 
artificially  adding  slowly  decreasing  white  Gaussian  noise  (i.e.,  the  b^W^  terms)  to  a 
stochastic  gradient  algorithm 

Zjc+i  =  Zk  —  ak(VU(Zk)  +  Ck)  (1-4) 

which  arises  in  a  variety  of  optimization  problems  including  adaptive  filtering, 
identification  and  control;  in  this  case  ^k  is  due  to  noisy  or  imprecise  measurements  of 
VXJ(*)  (c.f.  [16]).  The  idea  behind  using  (1.3)  instead  of  (1.4)  for  minimizing  U(’)  is  to 
avoid  getting  trapped  in  strictly  local  minima.  In  the  sequel  we  will  refer  to  (1.4)  and 
(1.3)  as  standard  and  modified  stochastic  gradient  algorithms,  respectively.  The  asymp¬ 
totic  behavior  of  Xk  as  k  — ♦  oo  has  been  studied  in  [14],  [15].  In  [14]  convergence 
results  were  obtained  for  a  version  of  (1.3)  which  was  modified  to  constrain  the  trajec¬ 
tories  to  lie  in  a  compact  set  (and  hence  is  only  applicable  to  global  optimization  over  a 
compact  subset  of  IR'^);  in  [15]  results  were  obtained  for  global  optimization  over  all  of 
IR*^.  Also,  in  [14]  convergence  is  obtained  essentially  only  for  the  case  where  £k  =  Oj 

[15]  convergence  is  obtained  for  {^k}  with  unbounded  variance.  This  latter  fact  has 
important  implications  when  VU(*)  is  not  measured  exactly.  Our  main  result  from  [15] 
can  be  roughly  stated  as  follows:  if  U(*)  and  {^k}  ^'I'e  suitably  behaved,  ak  =  A/k  and 
bk  =B/kloglogk  for  k  large  with  B/A  >Co  (the  same  Cq  as  above),  and  {Xk}  is  tight, 
then  Xk  — >-8*  as  k  — ►  cx)  in  probability  (conditions  are  also  given  in  [15]  for  tightness  of 
{Xk}).  Our  analysis  in  [15]  of  the  asymptotic  behavior  of  Xk  as  k  — »•  oo  is  based  on  the 
asymptotic  behavior  of  the  associated  SDE  (l.l).  This  is  analogous  to  the  well-known 
method  of  analyzing  the  asymptotic  behavior  of  Zk  as  k  oo  based  on  the  asymptotic 
behavior  of  the  associated  ODE  (1.2)  [16],  [17]. 
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It  has  also  been  suggested,  that  continuous  (global)  optimization  might  be  per¬ 
formed  by  simulating  a  continuous-state  Metropolis-type  Markov  chain  [10],  [18],  [19]. 
Although  some  numerical  work  has  been  performed  with  continuous-state  Metropolis- 
type  annealing  algorithms  there  has  been  very  little  theoretical  analysis,  and  further¬ 
more  the  analysis  of  the  continuous  state  case  does  not  follow  from  the  finite  state  case 
in  a  straightforward  way  (especially  for  an  unbounded  state  space).  The  only  analysis 
we  are  aware  of  is  in  [19]  where  a  certain  asymptotic  stability  property  is  established  for 
a  related  algorithm  and  a  particular  cost  function  which  arises  in  a  problem  of  image 
restoration. 

In  this  paper  we  demonstrate  the  convergence  of  a  class  of  continuous  state 
Metropolis-type  Markov  chain  annealing  algorithms  for  general  cost  functions.  Our 
approach  is  to  write  such  an  algorithm  in  (essentially)  the  form  of  a  modified  stochastic 
gradient  algorithm  (1.3)  for  suitable  choice  of  and  to  apply  results  from  [15].  A  con¬ 
vergence  result  is  obtained  for  global  optimization  over  all  of  Some  care  is  neces¬ 
sary  to  formulate  a  Metropolis-type  Markov  chain  with  appropriate  scaling.  It  turns 
out  that  writing  the  Metropolis-type  annealing  algorithm  in  the  form  (1.3)  is  rather 
more  complicated  than  writing  standard  variations  of  gradient  algorithms  which  use 
some  type  of  (possibly  noisy)  finite  difference  estimate  of  VU(*)  in  the  form  (1.4)  (c.f. 
[16]).  Indeed,  to  the  extent  that  the  Metropolis-type  annealing  algorithm  uses  an  esti¬ 
mate  of  VU(*),  it  does  so  in  a  much  more  subtle  manner  than  a  finite  difference  approxi¬ 
mation,  as  will  be  seen  in  the  analysis. 

Since  our  convergence  results  for  the  Metropolis-type  Markov  chain  annealing  algo¬ 
rithm  are  ultimately  based  on  the  asymptotic  behavior  of  the  Langevin-type  Markov 
diffusion  annealing  algorithm,  this  paper  demonstrates  and  exploits  the  close  relation¬ 
ship  between  the  Markov  chain  and  diffusion  versions  of  simulated  annealing,  which  is 
particularly  interesting  in  view  of  the  fact  that  the  development  and  analysis  of  these 
methods  has  proceeded  more-or-less  independently.  We  remark  that  similar  conver¬ 
gence  results  for  other  continuous-state  Markov  chain  sampling  method  based  annealing 
algorithms  (such  as  the  "heat  bath"  method)  can  be  obtained  by  a  procedure  similar  to 
that  used  in  this  paper. 

The  paper  is  organized  as  follows.  In  Section  2  we  discuss  appropriately  modified 
versions  of  the  tightness  and  convergence  results  for  modified  stochastic  gradient 
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algorithms  as  given  in  [15].  In  Section  3  we  present  a  class  of  continuous  state 
Metropolis-type  annealing  algorithms  and  state  some  convergence  theorems.  In  Section 
4,  we  prove  the  convergence  theorems  of  Section  3  using  the  results  of  Section  2. 


2.  MODIFIED  STOCHASTIC  GRADIENT  ALGORITHMS 

In  this  Section  we  give  convergence  and  tightness  results  for  modified  stochastic 
gradient  algorithms  of  essentially  the  type  described  in  Section  1.  The  algorithms  and 
theorems  discussed  below  are  a  slight  variation  on  the  results  of  [15],  and  are  appropri¬ 
ate  for  proving  convergence  and  tightness  for  a  class  of  continuous  state  Metropolis-type 
annealing  algorithms  (see  Section  3,4). 

We  use  the  following  notations  throughout  the  paper.  Let  VU(‘),  AU(*),  and 
HU(*)  denote  the  gradient,  Laplacian  and  Hessian  matrix  of  U(‘),  respectively.  Let  [•  j, 
<*,♦>  and  0  denote  Euclidean  norm,  inner  product,  and  outer  product,  respectively. 
For  real  numbers  a  and  b  let  a  V  b  =  maximum{a,b},  a  A  b  =  minimum{a,b}, 

[a]+  =  a  V  0,  and  [a]_  =  a  A  0.  For  a  process  {X^}  and  a  function  f(*),  let 

En_x{f(Xk)}>  Pii,x{f(^k)}  denote  conditional  expectation  and  probability  given  =  x 
(more  precisely,  these  are  suitable  fixed  versions  of  the  conditional  expectation  and  pro¬ 
bability).  Also  for  a  measure  /i(*)  and  a  function  f(*)  let  /Li(f)  =  Jfd//.  Finally,  let 
N(m,R)(*)  denote  normal  measure  with  mean  m  and  covariance  matrix  R,  and  let  I 
denote  the  identity  matrix. 

2.1.  Convergence 

In  this  subsection  we  consider  the  convergence  of  the  discrete  time  algorithm"^ 

Xk+i  =  Xt  -  ak(VU(Xt)  +  &)  +  bt{  |Xt  I  V  l)Wi  .  (2,1) 

Here  U(*)  is  a  smooth  real-valued  function  on  IR'^,  is  a  sequence  of  iR'^-valued  ran- 

dom  variables,  {W^}  is  a  sequence  of  independent  standard  d-dimensional  Gaussian  ran¬ 
dom  variables,  and 


The  results  are  not  changed  if  we  replace  [Xj^  |  V  1  by  jX^  |  V  a  or  |Xk  |  -f  a  for  a  S:  1. 
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A  ,  Vb 

ajc  =  —  ,  bk  =  - - - 

^  V  k  log  log  k 


,  k  large  , 


where  A,  B  are  positive  constants. 

For  k  =0,1,...  let  S’y  =  <^Xo,Wo,...,Wj5:_i,^o>— ^k-i)*  the  sequel  we  will  con¬ 

sider  the  following  conditions  (a,  fS  are  constants  whose  values  will  be  specified  later). 

(Al)  U(*)  is  a  function  from  IR*^  to  [0,oo)  such  that 

lim  ..I.W 


>  0 


X  -^oo 


X 


r„  /  ^(x)  X  \  ^ 

i.'lSJlvuwT  ’  ■R^ 

inf  ( |VU(x)  1 2  -  AU(x))  >  — oo 


(A2)  For  e  >  0  let 


d7r^(x)  =  — exp 
Z 


2U(x) 


dx  ,  =  /exp 


2U(x) 


dx  <  oo  . 


7r^  has  a  weak  limit  tt  as  e  — >  0. 

(A3)  Let  K  be  a  compact  subset  of  H'^.  Then  there  exists  L  >  0  such  that 


E{  I&  1^  l»k}  ^  Lag  ,  V  Xk  e  K  ,  w.p.l 
|E{4  l^k}  I  £  Lag  ,  V  Xk  €  K  ,  w.p.l  . 


Wjc  is  independent  of  3^]g_. 

We  note  that  tt  concentrates  on  S  ,  the  global  minima  of  U(*).  For  example,  if  S 
consists  of  a  finite  number  of  points,  then  tt  exists  and  is  uniformly  distributed  over  S*. 
The  existence  of  tt  and  a  simple  characterization  in  terms  of  HU(*)  is  discussed  in  [20]. 

In  [12]  and  [15]  it  was  shown  that  there  exists  a  constant  Cq  which  plays  a  critical 
role  in  the  convergence  of  (l.l)  and  (1.3),  respectively  (in  [12]  Cq  was  denoted  by  cq). 
Co  has  a  interpretation  in  terms  of  the  action  functional  for  the  dynamical  system  (1.2); 
see  [12]  for  an  explicit  expression  for  Cq  and  some  examples.  The  constant  Cq  plays  the 
same  role  in  the  convergence  of  (2.1)  considered  here. 
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Let  Kj  C  IR.'^  and  let  {X^}  denote  the  solution  of  (2.1)  with  Xq  =x.  We  shall  say 
that  {XJ:  k  >  0,  x  G  K^}  is  tight  if  given  e  >  0  there  exists  a  compact  K2  C  such 
that  Po,x{Xk  G  K2}  >  1  —  e  for  all  k  >  0  and  x  G  Kj.  Here  is  our  theorem  on  the  con¬ 
vergence  of  Xjc  as  k  — ►  00. 


Theorem  1:  Assume  (Al),  (A2),  (A3)  hold  with  a  >— 1  and  P  >  0.  Let  {X^}  be  given 
by  (2.1),  and  assume  {X^:  k  >  0,  x  G  K}  is  tight  for  K  a  compact  set.  Then  for 
B/A  >  Co  and  any  bounded  continuous  function  f(*)  on  IR*^ 

limEo,x{f(Xk)}  =  7r(f) 
k— ►oo 

uniformly  for  x  in  a  compact  set. 

Note  that  since  tt  concentrates  on  S  ,  under  the  conditions  of  Theorem  1  we  have 
Xk  — ►  S  as  k  00  in  probability. 

Theorem  1  is  proved  similiarly  to  [15,  Theorem  2]  where  we  considered  the  algo¬ 
rithm 


Xk+,  -Xk-ai(VU(Xk)  +  4)+btWk,  (2.2) 


and  we  will  not  go  through  the  details  here.  The  main  difference  between  the  condi¬ 
tions  and  proofs  of  Theorem  1  and  [15,  Theorem  2]  is  that  in  Theorem  1  the  condition 
lim  |VU(x)|/|x|  >  0  is  needed  to  establish  the  tightness  of  {Y(t)}  for  the  diffusion 

|x  |— *-oo 


dY(t)  =  —  VU(Y(t))dt  -t-  c(t)(  lY(t)  |  V  l)dW(t)  associated  with  (2.1),  whereas  in  [15, 
Theorem  2]  the  weaker  condition  lim  |VU(x)  |  =  00  suffices  to  establish  the  tightness 

|x  |— *-oo 


of  {Y(t)}  for  the  diffusion  dY(t)  =  — VU(Y(t))dt  -f  c(t)dW(t)  associated  with  (2.2). 


2.2.  Tightness 

In  this  subsection  we  consider  the  tightness  of  the  discrete  time  algorithm"^ 

Yk+i  =Xk  —  ak('!4(Yk)  +  %)  +  bk(  IXk  I  V  l)Wk  .  (2.3) 

Here  {'0k(*)}  ^.re  Borel  functions  from  IR*^  to  IR*^,  {r/k}  is  a  sequence  of  IR*^ -valued  ran¬ 
dom  variables,  and  {Wk},  {ak},  {bk}  are  as  in  Section  2.1.  Below  we  give  sufficient 


The  results  are  not  changed  if  we  replace  |Xk  |  V  1  by  |Xk|  V  a  or  |Xt|  -b  a  for  a  ^  0. 
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conditions  for  the  tightness  of  {X^:  k  ^  0,  x  G  K}  where  K  is  a  compact  subset  of 
Note  that  algorithm  (2.3)  is  somewhat  more  general  than  algorithm  (2.1).  The  reason 
for  considering  this  more  general  algorithm  is  that  it  is  sometimes  convenient  to  write 
an  algorithm  in  the  form  (2.3)  (with  ^(x)  ^  VU(x)  for  some  x,  k)  to  verify  tightness, 
and  then  to  write  the  algorithm  in  the  form  (2.1)  to  verify  convergence.  We  will  give  an 
example  of  this  situation  when  we  consider  continuous  state  Metropolis-type  annealing 
algorithms  in  Sections  3  and  4. 

Let  ^ke  sequel  we  will  consider  the  follow¬ 
ing  conditions  (a,  (3,  72  ^.re  constants  whose  values  will  be  specified  later). 

(Bl)  Let  K  be  a  compact  subset  of  IR*^.  Then 

sup  .  I4(x)  I  <  00 
kjxsK 


lim 

k,  |x  |— ►oo 


lim 

k,  |x  |— *-oo 


I4(x)| 
l4(x)  I 


a^'  <  00 

-ak®  >  0 


lim  < 

k,  |x 


4(x)  X  s 
i4(x)  I  ’ 


>  0 


(B2)  There  exists  L  >  0  such  that 

E{  k  I'  l@k}  £  La5(  |Xk  V  1)  w.p.l 
|B{%  |«k}  I  S  l4(  |Xk  I  V  1)  w.p.l 

Wk  is  independent  of  ^^k* 


Theorem  2;  Assume  that  (Bl),  (B2)  hold  with  a  >  —1,  0  >  0,  and 

0  <  72  7i  <  1/2.  Let  {Xk}  be  given  by  (2.3)  and  K  be  a  compact  subset  of 
Then  {XJ:  k  >  0,  x  E  K}  is  a  tight  family  of  random  variables. 

Theorem  2  is  proved  similarly  to  [15,  Theorem  3]  where  we  considered  the  algo¬ 
rithm 

Xk+i  =  Xk  —  ak(4(Xk)  +  %)  +  bkWk  (2.4) 

and  we  will  not  go  through  the  details  here.  The  main  difference  between  the 
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conditions  and  proofs  of  Theorem  2  and  [15,  Theorem  3]  is  that  in  [15,  Theorem  3]  we 
allowed  x  G  to  be  a  random  vector  field  but  we  did  not  allow  the  bounds  in 

(B2)  to  depend  on  |x  |. 


3.  METROPOLIS-TYPE  ANNEALING  ALGORITHMS 


In  this  Section  we  review  the  finite  state  Metropolis- type  Markov  chain  annealing 
algorithm,  generalize  it  to  an  arbitrary  state  space,  and  then  specialize  it  to  a  class  of 
algorithms  for  which  the  results  in  Section  2  can  be  applied  to  establish  convergence. 

The  finite  state  Metropolis-type  annealing  algorithm  may  be  described  as  follows 
[5[.  Assume  that  the  state  space  E  is  finite  set.  Let  U(»)  be  a  real  valued  function  on  E 
(the  "energy"  function)  and  {Tj^}  be  a  sequence  of  strictly  positive  numbers  (the  "tem¬ 
perature"  sequence).  Let  q(i,j)  be  a  stationary  transition  probability  from  i  to  j,  for 
i,j  E  E.  The  one-step  transition  probability  at  time  k  for  the  finite  state  Metropolis- 
type  annealing  chain  {X^}  is  given  by 

P{Xk+i  =  j  |Xk  =  i}  =  q(i,j)sk(i,j)  ,  j  #  i  , 

P{Xk+j  =  i  |Xk  =  i}  =  1  -  S<j(i.jK(i,j) 


where 


Sk(bj)  =  exp 


[U(j)-U(i)]J 

Tk 


(3.2) 


This  nonstationary  Markov  chain  may  be  interpreted  (and  simulated)  in  the  following 
manner.  Given  the  current  state  Xk  =  i,  generate  a  candidate  state  Xk  =  j  with  proba¬ 
bility  q(i,j).  Set  the  next  state  Xk+i  =  j  if  Sk(i,j)  >  ^k  where  ^k  is  an  independent  ran¬ 
dom  variable  uniformly  distributed  on  the  interval  [0,1];  otherwise  set  Xk+i  =  i.  Sup¬ 
pose  that  the  stochastic  matrix  Q  =  [q(i,j)]  is  symmetric  and  irreducible,  and  the  tem¬ 
perature  Tk  is  fixed  at  a  constant  T  >  0.  Then  it  can  be  shown  that  the  resulting  sta¬ 
tionary  Markov  chain  has  a  unique  invariant  Gibbs  distribution  with  mass  oc 
exp(— U(i)/T),  and  furthermore  converges  to  this  Gibbs  distribution  as  k  — *■  cx)  [21]. 
There  has  been  alot  of  work  on  the  convergence  and  asymptotic  behavior  of  the  nonsta¬ 
tionary  annealing  chain  when  Tk  — ^  0  [4]- [9]. 
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We  next  generalize  the  finite  state  Metropolis-type  annealing  algorithm  (3.1),  (3.2) 
to  a  general  state  space.  Assume  that  the  state  space  E  is  a  cr-finite  measure  space 
(S,  A  ,/i).  Let  U(*)  be  a  real-valued  measurable  function  on  E  and  let  {T]^}  be  as 
above.  Let  q(x,y)  be  a  stationary  transition  probability  density  w.r.t.  fj,  from  x  to  y,  for 
x,y  G  E.  The  one-step  transition  probability  at  time  k  for  the  general  state  Metropolis- 
type  annealing  chain  {X]^}  is  given  by 

P{Xk+i  G  A  |Xk  =  x}  =  j^q(x,y)sk(x,y)dAt(y)  -f-  rk(x)lA(x)  (3.3) 

where 


and 


rk(x)  =  1  -  /q(x,y)sk(x,y)d/i(y)  , 


Sk(x,y)  =  exp 


[U(y)  -  U(x)]^ 

Tk 


(3.4) 


(3.5) 


Note  that  if  fJ,  does  not  have  an  atom  at  x,  then  rk(x)  is  the  self  transition  probability 
starting  at  state  x  at  time  k.  Also  note  that  (3.3)-(3.5)  reduces  to  (3.1),  (3.2)  when  the 
state  space  is  finite  and  fj,  is  counting  measure.  The  general  state  chain  may  be  inter¬ 
preted  (and  simulated)  similarly  to  the  finite  state  chain:  here,  q(x,y)  is  a  conditional 
probability  density  for  generating  a  candidate  state  Xk  =  7  given  the  current  state 
Xk  =  X.  Suppose  that  the  stochastic  transition  function  Q(x,A)  =  f  q(x,y)dfj,(y)  is  fjn 

symmetric  and  irreducible,  and  the  temperature  Tk  is  fixed  at  a  constant  T  >  0.  Then 
similarly  to  the  finite  state  case  it  can  be  shown  that  the  resulting  stationary  Markov 
chain  has  a  jx  -a.e.  unique  invariant  Gibbs  distribution  with  density  ocexp(— U(x)/T), 
and  furthermore  if  a  certain  condition  due  to  Doeblin  [21]  is  satisfied  converges  to  this 
Gibbs  distribution  as  k  — >■  oo.  There  has  been  almost  no  work  on  the  convergence  and 
asymptotic  behavior  of  the  nonstationary  annealing  chain  when  Tk  — ^  0,  although  when 
E  is  a  compact  metric  space  one  would  expect  the  behavior  to  be  similar  to  when  E  is 
finite. 


We  next  specialize  the  general  state  Metropolis-type  annealing  algorithm  (3.3)-(3.5) 
to  a  d-dimensional  Euclidean  state  space.  Actually  the  Metropolis-type  annealing  chain 
we  shall  consider  is  not  exactly  a  specialization  of  the  general-state  chain  described 
above.  Motivated  by  our  desire  to  show  convergence  of  the  chain  by  writing  it  in  the 
form  of  the  modified  stochastic  gradient  algorithm  (2.1),  we  are  led  to  choosing  a 
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nonstationary  Gaussian  transition  density 


qk(x,y) 


1 

[  1  |y-xP 

(2A?(  |x  “  V  l))'*/2 

2  b|(|xP  V  1) 

(3.6) 


and  a  state  dependent  temperature  sequence 

b|(  |x  I*  V  1) 


Tk(x) 


2aL 


const.(  |x  P  V  1) 
log  log  k 


(3.7) 


The  choice  of  the  transition  density  is  clear,  given  we  want  to  write  the  chain  in  the 
form  of  (2.1).  The  choice  of  the  temperature  sequence  is  based  on  the  following  con¬ 
siderations.  Ignore  for  the  moment  the  dependence  on  |x  |  and  examine  the  modified 
stochastic  gradient  algorithm  (1.3)  and  the  associated  diffusion  (1.1).  If  we  view  (1.3)  as 
a  sampled  version  of  (l.l)  with  sampling  intervals  av  and  sampling  times  ti,  =  y^^^an, 
then  we  have  corresponding  sampled  temperatures  T(tk)  =c^(tk)/2,  and  it  is  straight¬ 
forward  to  check  that  if  C  =  B/A  then 

b|  c^(tk) 

Tk  =  ~  ^ —  =  T(tk)  as  k  -»•  oo  , 


Finally,  the  fundamental  reason  that  the  |x  |  dependence  is  needed  in  both  (3.6),  (3.7)  is 
that  in  order  to  establish  tightness  of  the  annealing  chain  by  writing  the  chain  in  either 
the  form  of  (2.3)  or  (2.4)  we  need  a  condition  like 

|'i4(x)  I  >  const.  |x  I  ,  |x  I  large  ,  (3.8) 

for  suitable  choice  of  In  words,  the  annealing  chain  must  generate  a  drift 

(towards  the  origin)  at  least  proportional  to  the  distance  from  the  origin.  To  accom¬ 
plish  this  we  include  the  dependence  on  |x  |  in  (3.6),  (3.7)  and  then  write  the  chain  in 
the  form  of  (2.3)  to  establish  tightness.  This  discussion  leads  us  to  the  following  con¬ 
tinuous  state  Metropolis-type  Markov  chain  annealing  algorithm. 
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Metropolis-type  Annealing  Algorithm  #1: 


Let  {Xk}  be  a  Markov  chain  with  1-step  transition  probability  at  time  k  given  by"* 


P{Xk+i  €  A  |Xk  =  x}  =  j^Sk(x,y)dN(x,b|(  |x  V  l)I)(y)  -j-  rk(x)lA(x)  (3.9) 


where 


rk(x)  =  1  -  /sk(x,y)dN(x,b|(  |x  P  V  l)I)(y)  (3.10) 

and 


%(x,y)  =exp 


2a|,  |U(y)  -  U(x)].^  ' 
bk  l=t|*  V  1 


(3.11) 


Theorem  3:  Assume  (Al),  {A2)  hold  and  also 

sup  1HU(x)  I  <  oo  .  (3.12) 

X 

Let  {Xk}  be  the  Markov  chain  with  transition  probability  given  by  (3.9)-(3.11).  Then 
for  B/A  >  Co  and  any  bounded  continuous  function  f(*)  on  IR'^ 

lim  Eo,x{f(Xk)}  =  7r(f)  (3.13) 

k— >>00 

uniformly  for  x  in  a  compact  set. 

The  proof  of  Theorem  3  is  in  Section  4.1.  Observe  that  the  condition  (3.12)  can  be 
rather  restrictive.  It  implies  along  with  (Al)  that  there  exists  constants  Mi,M2  such 
that 

Ml  |x  I  :<  |VU(x)  I  :<  M2  |x  |,  |x  |  large  . 

It  turns  out  that  the  lower  bound  on  |VU(x)  |  is  essential  but  the  upper  bound  on 
|VU(x)  I  can  be  weakened  by  using  a  suitable  modification  of  (3.11)  as  follows. 


''' The  results  are  not  changed  if  we  replace  |x  P  V  1  by  |x  p  V  a  or  |x  p  -f  a  for  a  >  1. 
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Metropolis-type  Annealing  Algorithm  #2: 

Let  {Xk}  be  a  Markov  chain  with  1-step  transition  probability  at  time  k  given  by'^ 
P{Xk+i  G  A  |Xk  =  x}  =  j|^Sk(x,y)dN(x,b|(  |x  p  V  l)I)(y)  +  rk(x)lA(x)  (3.14) 

where 

rk(x)  =  1  -  /sk(x,y)dN(x,bi(  |x  P  V  l)I)(y)  (3.15) 

and 


Sk(x,y)  =exp 


2ak  [U(y)-U(x)] 


+ 


hi 


V  1 


if  U(x) 


|x  ^  V  1 
ag 


exp 


/ 

2ak 

[|y|^-  |xP]/ 

bk 

|xP  V  1 

if 


U(x)  > 


ag 


(3.16) 


and  7  >  0. 


Theorem  4:  Assume  (Al),  (A2)  hold  and  also 


inf  lim 
i5>0  |x|-^oo 


sup 

|y-xi<i5|x| 


|HU(y)  I 


<  oo  . 


(3.17) 


Let  {Xk}  be  the  Markov  chain  with  transition  probability  given  by  (3.14)-(3.16)  with 
0  <  7  <  l/4.  Then  for  B/A  >  Cq  and  any  bounded  continuous  function  f(»)  on 


limEo,x{f(Xk)}  =  7r(f) 


(3.18) 


uniformly  for  x  in  a  compact  set. 

The  proof  of  Theorem  4  is  in  Section  4.2.  Observe  that  the  condition  (3.17)  (and 
also  (Al))  will  be  satisfied  if  U(x)  ~  const.  |x  |p  and  HU(x)  =  0(  |x  as  |x  |  ->  oo 
for  any  p  >  2.  Note  that  if  K  is  any  fixed  compact,  Xk  €  K,  and  k  is  very  large,  then 
(3.16)  and  (3.11)  coincide.  Note  also  that  (3.16)  like  (3.11)  only  uses  measurements  of 
U(‘)  (and  not  VU(»)). 


The  results  are  not  changed  if  we  replace  |x  V  1  by  |x  V  a  or  |x  P  -}-  a  for  a  >  1. 
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4.  PROOFS  OF  THEOREMS  3  AND  4 


In  the  sequel  ci,C2,...  will  denote  positive  constants  whose  value  may  change  from 
proof  to  proof.  We  will  need  the  following  lemma. 

Lemma  1:  Assume  that  V(*)  is  a  function  from  IR'^  to  IR.  Let 

s(x,y)  =  exp(-X[V(y)  -  V(x)]+) 

and 

s(x,y)  =exp(-X[<W(x),y-x>  ]+) 

where  X  >  0.  Then 

|s(x,y)  -  s(x,y)  |  <  X  sup  jHV(x  +  e(y  -  x))  [  |y  -  x  P 

f€(o,i) 

for  all  x,y  G  IR*^. 


Proof:  Let 


%,y)  =  V(y)  -  V(x)  -  <  W(x),y-x>  . 


Then  by  the  2nd  order  Taylor  Theorem 

|f(x,y)  I  <  sup  |HV(x  +  e{y  -  x))  |  |y  -  x  P  (4.1) 

«€(0,1) 

By  separately  considering  the  four  cases  corresponding  to  the  possible  signs  of 
V{y)  —  V(x)  and  {  VV{x),y— x)  ,  it  can  be  shown  that 

|s(x,y)  -  s(x,y)  I  <  1  -  exp{-X  |f(x,y)  |)  <  X  |f(x,y)  |  (4.2) 


Combining  (4.1)  and  (4.2)  completes  the  proof. 


□ 


4.1.  Proof  of  Theorem  3 
We  write 

Xk+i  =  Xk  -  ak(VU(Xk)  +  4)  +  bk(  IXk  1  V  l)Wk  (4.3) 

(this  defines  ^k)  apply  Theorem  1  to  show  that  if  {Xk:  k  >  0,  x  €  K}  is  tight  for  K 
compact  then  (3.13)  is  true.  We  further  let  t/)k(x)  =  VU(x)  and  rj^  =  ^k  apply 
Theorem  2  to  show  that  {XJ:  k  >  0,  x  G  K}  is  infact  tight  for  K  compact  and  (3.13)  is 
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infact  true. 

We  first  show  that  we  can  find  a  version  of  {X^}  in  the  form 

Xt+I  =Xt  +bk(  |Xk  I  V  l)aWk  (4.4) 

To  do  this  we  inductively  define  the  sequence  of  random  variables  as  follows. 

Assume  Xo,Wo,...,Wk_i,fo>-")$'k-i  have  been  defined.  Let 

=  c?(Xo,Wo,Wjj:_i,fov”>^’k-i)*  Let  be  a  standard  d-dimensional  Gaussian  ran¬ 
dom  variable  independent  of  and  let  fjj  be  a  {0,l}- valued  random  variable  with 

p{fk  =  1  K,Wk}  =  Sk(Xi,Xk  +  bk(  |Xk  I  V  l)Wt)  .  (4.5) 

Note  that  Pjf'k  =  i  =  P{Ck  =  i  |Xic,Wjc}.  Using  (4.5)  it  is  easy  to  check  that 

(4.4)  is  a  Markov  chain  which  has  transition  probability  given  by  (3.9)-(3.11).  Hence 
(4.4)  is  indeed  a  version  of  {Xk}  and  we  always  deal  with  this  version  in  the  sequel. 

Now  comparing  (4.3)  and  (4.4)  we  have 

fk  =  -VU(Xk)  +  — ( |Xk  I  V  1)(1  -  fk)Wk  (4.6) 

ak 

and  in  particular  ^k  is  a  function  of  Xk,  Wk  and  fk*  Note  that  since  ^k  C«s^k)  ^k  is 
independent  of  and  P{fk  =  i  I'S^ki^k}  =P{fk  =  i  IXkjWk},  it  follows  that  Wk  is 
independent  of  ^k  P{fk  ==  i  l^k>^k}  =  P{fk  =  i  |Xk)Wk}.  Hence 

P{Ck  €  A  l^k}  =  P{6k  ^  A  IXk}.  We  will  use  these  facts  below. 

The  following  lemma  gives  the  crucial  estimates  for  E{  |^k  P  l^k} 

|E{&  |Sfk}  I- 

Lemma  2:  There  exists  L  >  0  such  that 

a)  lEfelS'k}!  £  L.5^(|Xk|  V  1)  w.p.l 

bk 


b)  E{  14  P  |Sfk}  <  L— ( |Xk  P  V  1)  w.p.l 

ak 


Assume  that  Lenoma  2  is  true.  Then  (A3)  is  satisfied  with  a  =  — —  >  —1  and 

0  <  /?  <  — ,  and  (Bl),  (B2)  are  satisfied  for  the  same  choice  of  a  and  ^  and  for 
2 

li  — 12  —  0*  Hence  Theorems  1  and  2  apply  and  Theorem  3  follows.  It  remains  to 
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prove  Lemma  2.  We  will  use  the  following  claim. 

Claim:  Let  u  G  with  |u  |  =  1.  Then 

a)  /  dN(0,I)(w)  =  0{5) 

0:<<^  Ti,w)  ^5 

b)  J  wdN{0,I)(w)  =  0(6^) 

c)  J  -w  0  wdN(0,I)(w)  =  0(<5) 

0:<<(  u.w) 


Proof:  Let  ui  =  u  and  extend  u^  to  an  orthonormal  basis  {ui,...,Ujj}  for  IR'^.  Then  by 
changing  variables  (rotation)  and  using  the  Mean  Value  Theorem  we  get 


■  s  1 


a)  /  dN(0,I){w)  =  _ 

u,wy  SS 


exp 


dv  =  0(6) 


b)  /  wdN(0,I)(w)  =  Ui/  V  ^,7,  exp 


0<<(  -Uiw)  £S 


■’o  (27r)i/2 


v" 
’  2 


dv  =  0(6^) 


c)  /  w  <g)  wdN(0,I)(w)  =  ui  ®  ui/ V— -^exp 
o<<  ii,w>  <S  °  {27r)  ' 


V 

T 


dv 


+  k'  ® 


V 

T 


dv 


0(^ 


□ 


Proof  of  Lemma  2a): 

Using  (4.6)  and  the  fact  that  P{Ck  ^  A  l^k}  =  P{Ck  ^  l^k}  is  indepen¬ 

dent  of  X]j  we  have  (w.p.l) 
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E{ekl^k}=E{4|Xk} 

=  -VU(Xk)  +  |Xk  I  V  1)E{(1  -  fk)Wk  |Xk} 

ak 

=  -VU(Xk)  -  |Xi  I  V  l)E{WkE{a  |Xt,Wk}  |Xt} 

=  -VU(Xfc)  -  |Xk  I  V  l)E{Wi,P{a  =  1  fc.Wk}  |Xt} 

=  -VU(Xk)  -  |Xi  I  V  1)E  {WiPfo  =  1  |Xt,Wk}}  . 


Henceforth  we  condition  on  Xk  =  x  where  |x  |  >  1;  the  case  where  jx  |  <  1  is  similar. 
Hence  using  (4.5) 

E{6k  |Xk  =  x}  =  -VU(x)  —  —  |x  |/wsk(x,x  +  bk  |x  |w)dN(0,I)(w)  . 

ak 

Let 


4(x,y)  =exp 


2ak  [<VU(x),y-x> 


J+ 


and  Sk(x,y)  -  Sk(x,y)  -  Sk(x,y).  Then  by  (3.12)  and  Lemma  1 


|sk(x,y)  1 


Cl 


%  |y  —  X  P 

bF  ixP 


(4.7) 


(4.8) 


Hence 
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E{4  |Xk=x}  =  -VU{x) - |x  |/wsk(x,x  +  bk  |x  |w)dN(0,I)(w) 

Ek 


Ek 


|x  |/wsk(x,x  +  bk  lx  |w)dN(0,I){w) 
bk 


VU(x) - —  |x  |Jwsk(x,x  +  bk  |x  |w)dN(0,I)('w) 

Ek 


+  0(bk  |x  I) 

-  VU(x)  -  —  |x  I  /  wdN(0,I)(w) 

<  VU(x),’w>  <0 


Ek 

+  0(bk  |x  I)  . 


—  |x  I  f  w  exp 

<  VU(x),w>  >0 


2ak  { VU(x),w) 


(4.9) 


dN(0,I)(w) 


(4.10) 


CleErly 


E{4  |Xk  =  x}  =  0(bi  |x  I) 


(4.11) 


for  X  such  thEt  VU(x)  =  0.  Henceforth  we  Essume  thEt  VU(x)  5^  0.  Let 
VU(x)  =  VU(x)/ 1  VU(x)  |.  Completing  the  squEre  in  the  second  integrEl  in  (4.10)  we  get 

E{4  |Xk  =  x}  =  -  VU(x)  -  —  |x  I  /  wdN(0,I)(w) 

<  VU(x),’w>  <0 


Ek 


|x|  /  wexp 

<  VU(x),w>  >  0 


Efc 


bk 


|VU(x)  I 


dN 


2ak  VU(x) 


(w) 


+  0(bk  |x  I) 

Now  by  (3.12)  |VU(x)  |  =  0(  |x  |)  End  so 


exp 


ak 


bk 

\  / 


|VU(x)  I 


13 


=  1  +  0 


ajc 


(4.12) 


(4.13) 


Substituting  (4.13)  into  (4.12),  using  Ek/bk  =  0(1)  End  |VU(x)  |  =  0(  |x  |),  End  chnng- 
ing  VEriEbles  from  w  +  2(Ek/bk)(VU(x)/  |x  |)  to  w  gives 
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E{&|Xt-x}  =  -VU(x)-  — |x|  /  wdN(0,I)(w) 

<  VU(x),-w>  <0 


X 


a-k 


I 


wdN(0,I)('w)  +  2VU(x) 


I 


dN(0,I)(w) 


<  VU(x),w>  >  0{^) 

Die 


<  VU(x),w>  >  O(^) 


+  0 

I 

Ek 


Ek 

i>k 


X  I  +  0(bk  |x  1) 


I 


■wdN(0,I)(w) 


o<<  v^x),^)  ^o(-^) 

Du 


2VU(x) 


/ 


dN(0,I)(w) 


0<<  VU(x),w)>  ^0(-^) 
Dk 


+  0 


Ek 


(4.14) 


Hence  by  the  ClEim  pErts  e),  b)  End  EgEin  using  |  VU(x)  |  =  0(  |x  J)  we  hsve 


E{&|Xk-x}  =  0 


Ek 


(4.16) 


Combining  (4.11)  End  (4.15)  completes  the  proof  of  LemmE  2e). 


Proof  of  Lemma  2b): 

Using  (4.6)  End  the  fEct  thEt  P{^k  =P{6k  C  A  |Xk},  is  independent 

of  Xk,  End  I  VU(x)  I  =  0(  |x  |)  we  hEve  (w.p.l) 
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B{4  ®  4iSfk}=>E{4  ®  4|Xk} 

h  f 

(|XkP  V  l)E{((l-fk)Wk)®  ((l-fk)Wk)lXk}+ek(Xk) 


Ek 


\2 


Ek 


( |Xk  P  V  1)1 


'bk^ 


bk 


^2 


Ek 


( |Xk  I"  V  1)1 

J 

“  '  (  |Xk  P  V  1)1 


r,  \2 
bk 


^ak  ^ 
''b  ^ 


Ek 

bk 


( IXk  p  V  l)E{Wk  0  WkE{Ck  |Xk,Wk}  |Xk}  +  ek(Xk) 

( |Xk  P  V  l)E{Wk  0  WkP{rk  =  1  lXk,Wk}  IXk}  +  ek{Xk) 


,ak^ 


( |Xk  P  V  1)E  {Wk  0  WkP{fk  =  1  |Xk,Wk}}  +  ek(Xk) 

Wi. 


where 


ek(Xk)  =  O 


Ek 


( |Xk  P  V  1) 


Henceforth  we  condition  on  Xk  =  x  where  |x  |  S  1;  the  case  where  |x  |  ^  1  is  similar. 
Hence  using  (4.5) 


E{4  ®  4  |Xk  =x} 


^2 


Ev 

v  / 


|xPl 


bk 


^Ek  ^ 


|x  P  Jw  0  wSk(x,x  +  bk  lx  |w)dN(0,I)(w) 


+  0 


bk 

Ek 


Let  Sk(x,y)  be  given  by  (4.7)  End  Sk(x,y)  =  Sk(x,y)  —  Sk(x,y).  Then  using  (4.8) 
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E{ek<^  eklXk=x}  = 


Ek 

V  / 


X|2l 


bi 


Rk 

\  / 


X  P  Jw  0  wsjc(x,x  +  bk  |x  |w)dN{0,I){w) 


/-  -^2 
bt 


Rv 

V  ^  / 


xP/  w  0  wsk(x,x  +  bk  |x  |w)dN(0,I)(w) 


+  0 


/  ^12 
bi 


aif 


+  0 


r,  ^2 

bk 


at 

V.  ^  / 


^|xP 

Rk 

IxPl- 

i^ixp 

Rk 

|x  Pi  - 


Rif 


+  0 


lx  p/w  0  wsk(x,x  +  bk  |x  |w)dN{0,I)(w) 

bk  I  I 


Rk 


(4.16) 


''u 

bi 


Rk 


lx  1^  /  w  0  wdN(0,I)(w) 


<  VU(x),w>  <0 


f-,  ^ 

bi 


^ak  ^ 


+  0 


|x  p  J  w  0  wexp 

<  VU(x),w>  >0 
\ 


.L(o,i)(w) 


^IxP 

Rk 


(4.17) 


Clearly 


E{&®  4|Xk=x}-0 


(4.18) 


for  X  such  that  VU(x)  =  0.  Henceforth  we  assume  that  VU(x)  ^  0.  Let 
VU(x)  =  VU(x)/|VU(x)  |.  Completing  the  square  in  the  second  integral  in  (4.17),  and 
proceeding  similiarly  to  the  derivation  of  (4.14)  in  the  proof  of  Lemma  2a)  we  have 
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E{4  Ck  |Xk  =x} 


/  \ 
bk 

2 

lx  \H  - 

bkf 

Ek 

\  ^ 

IX  1  1 

[akj 

|x  P  J  w  0  wdN(0,I)(w) 

<  U(x),w>  <0 


bk 


\2 


^ak  ^ 


|x  p  /  w  0  wexp 
<  VUCx),^)  >  0 


Ek 


■VZ 


bk 


|VU(x)  1 


■^2 


■RT 


dN 


2ak  VU(x) 


+  0 


Ek 


'bk'* 


3>ir 

V  ^  / 


X  n  - 


Ek 


jx  p  f  w  0  wdN(0,I)(w) 

<  VU(x),'w>  <0 


^  |x  p  /  W  0  wdN(0,I)(w) 


aic  . 


<  VU(x),w>  >  0(-— ) 


+  0(  |xP)  +  0 


bk 


Ek 

V  ^  / 


+  O 


—  |x  P 

Ek 


/ 


w  0  wdN(0,I)(w) 


o:<<  vu(x),w) 

lx  I 


—  UP 

Ek 


(w) 


Hence  by  the  Claim  pErt  c) 


E{ek  Ck  IXk  =  x}  =  O 


Ek 


(4.19) 


Combining  (4.18)  End  (4.19)  End  using  the  fEct  thEt  l^k  <2)  Ck  1  —  l^k  1  completes  the 
proof  of  Lemma  2b). 
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Remark:  In  Figure  1  we  demonstrate  the  type  of  approximations  used  in  the  proof  of 
Theorem  3.  In  Figure  l(i)  we  show  the  transition  density  Pk{x,y)  for  the  Markov  chain 
with  transition  probability  given  by  (3.9)-(3.1l);  in  Figure  l(ii)  we  show  the  transition 
density  Pk(x,y)  for  the  same  Markov  chain  but  using  Sk(x,y)  (eqn.  (4.7))  in  place  of 
Sk(x,y)  (eqn.  (3.11));  and  in  Figure  l(iii)  we  show  the  transition  density  pk(x,y)  for  the 
Markov  chain  of  (2.1)  with  ^k  =  0*  Note  that  the  densities  in  Figures  l(i)  and  (ii)  con¬ 
tain  impulsive  components  associated  with  the  positive  probability  of  no  transition.  All 
three  densities  are  "close"  for  sufficiently  large  k. 

4.2.  Proof  of  Theorem  4 


We  write 

Xk+i  =  Xt  -  ak(Vn(Xk)  +  4)  +  bk(  |Xk  I  V  l)Wk 

(this  defines  (^k)  ^.nd  apply  Theorem  1  to  show  that  if  {Xk:  k  >  0,  x  G  K}  is  tight  for  K 
compact  then  (3.18)  is  true.  We  further  let 

jxP  V  1 


V^(x)  =VU(x)  if  U(x) 


a2 


2x 


if  U(x)  > 


xP  V  1 

a2 


and  write 


Nk+i  =  Xk  —  ak(i4(Xk)  +  %)  +  bk(  |Xk  I  V  l)Wk 

(this  defines  T^k)  and  apply  Theorem  2  to  show  that  {Xk:  k  >  0,  x  E  K}  is  infact  tight 
for  K  compact  and  (3.18)  is  infact  true. 

The  following  lemmas  give  the  crucial  estimates  for  E{  |^k  P  l^k}»  |E{Ck  l^k}  L 
El  \  Vk  P  I'^k}  |E{?7k  l^k}  I  (compare  with  Lemma  2). 

Lemma  3:  Let  K  be  a  compact  subset  of  Then  there  exists  L  >  0  such  that 


»)  lEfe  Is^k}  I  £  L-^  V  Xk  e  K,  w.p.l 

Dk 
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b)  E{  l^k  P  l^k}  <  L-^  V  Xk  €  K,  w.p.l 

ak 


Lemma  4:  There  exists  L  >  0  such  that 


al-27 

a)  |E{%  |«i}  I  <  L-i— ( |Xk  I  V  1)  w.p.l 

bk 


b)  E{  hk  P  <  L-^(  IXk  P  V  1)  w.p.l 


Assume  that  Lemmas  3  and  4  are  true.  Then  (A3)  is  satisfied  with 

a  =  - - 7  >  — 1  and  0  <.  I3  <  — - 27,  and  (Bl),  (B2)  are  satisfied  with  a  —  — — , 

2  2  2 

0  <  <  — ,  7i  =  7  and  72  =  0  (recall  that  we  assume  0  <  7  <  — ).  Hence  Theorems 

2  4 

1  and  2  apply,  and  Theorem  4  follows.  It  remains  to  prove  Lemmas  3  and  4. 

Proof  of  Lemma  3; 

In  the  sequel  we  condition  on  Xk  =  x  where  x  €  K  and  |x  |  >  1;  the  case  where 
jx  I  :<  1  is  similar.  Let 


Sk(x,y)  =  exp 


2ak  [<VU(x),y-x>  ]+  |x 

T2 - rni -  U(x)  <  -4 


2ak  [<2x,y-x>  ]+ 


if  U(x)  > 


(4.20) 


and  Sk(x,y)  =  Sk(x,y)  —  Sk(x,y).  Using  the  fact  that  HU(*)  is  bounded  on  a  compact  we 
get  for  any  fixed  5  >  0 

sup  |HU(x  +  e(y  -x))  |  <  sup  |HU(z)  |  <  Cj  , 

£€(0,1)  iz-x|<«ix| 

for  all  |y— x|  <  (5|x|,  and  in  particular  the  inequality  holds  when  U(z)  =  |z  P. 
Hence  by  considering  the  two  cases  where  U(x)  is  <  or  >  |x  p/ajj  and  using  Lemma  1 
we  get 
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|sk(x,y)|  <  c2-^g-  fe- . ,  |y-x|  <  (5|x|  (4.21) 

bk  ix  r 

Note  that  (4.21)  unlike  (4.8)  only  holds  for  |y  —  x  |  <  ^|x  |.  Of  course 

|sk(x,y)  I  <  1  (4.22) 

Using  (4.21),  (4.22)  and  a  standard  estimate  for  the  tail  probability  of  a  Gaussian  ran¬ 
dom  variable  we  get  for  i  >  0 


/  k  1'  |sk(x,x  +  bk  lx  k; 

<  /  k  F 

lsk(x,x 

|w  1  <  i/hy 

+  /  k  F 

!sk(x,: 

b  |><5/bk 

/  \ 

C4 

<  cgak  +  cgexp 

0(ak) 


(4.2S) 


Using  (4.23)  we  get  similarly  to  the  derivation  of  (4.9)  and  (4.16) 

|Xk  =  x}  =  -VU(x) - -\x  |/wsk(x,x  +  bjc  lx  |w)dN(0,I)(w) 

ak 


+  0(bk) 


(4,24) 


and 


E{&  ®  4  |Xk  -x} 


bk 


ak 


|x  - 


/  '»2 


^ak  ^ 


|x  P/w  <S>  wsjc(x,x  -f  bjc  |x  |w)dN(0,I)(w) 


+  0 


^ak  ^ 


(4.25) 


Now  recall  that  x  E  K  and  7  >  0.  Hence  for  k  large  enough  (and  it  is  enough  to  con¬ 
sider  large  k),  U(x)  <  |x  P/aJ^  so  that  Sk(x,y)  which  was  defined  by  (4.20)  is  the  same 
as  (4.7),  and  consequently  (4.24)  and  (4.25)  are  the  same  equations  as  (4.9)  and  (4.16), 
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respectively  (except  for  the  error  terms).  Lemma  3  now  follows  by  the  same  procedure 
as  in  the  proof  of  Lemma  2  except  now  VU(x)  =  0(1)  instead  of  VU(x)  =  0(  |x  |). 

□ 


Proof  of  Lemma  4: 


In  the  sequel  we  condition  on  =  x  where  |x  |  >  1;  the  case  where  |x  |  :<  1  is 
similar.  Let  sjj;(x,y)  be  given  by  (4.20)  and  sic(x,y)  =  —  S](.(x,y).  Using  (3.17)  we 

get  for  some  8  >  0 


sup  |HU(x  +  e(y  -x))  |  < 
£e(o.i) 


sup 

Iz— X  |<i5  lx 


|HU(z)  I  <  Cl 


U(x) 


+  1 


X 


for  all  |y  —  x|  <  5|x|,  and  in  particular  the  inequality  holds  when  U(z)  =  |z  p. 
Hence  by  considering  the  two  cases  where  U(x)  is  :<  or  >  |x  P/aJJ  and  using  Lemma  1 
we  get 


|sk(x,y) 


C2- 


zi  ^ 


|y  -X 


be 


|y  —  X  1  <  5lx 


(4.26) 


Using  (4.26)  we  get  similarly  to  the  derivation  of  (4.23) 

/  |w  |‘  |sk(x,x  +  bk  |x  |w)  |dN(0,I)(w)  =  0(a]^“'^) 
Using  (4.27)  we  get  similiarly  to  the  derivation  of  (4.9)  and  (4.16) 


1^ 

E{??k  |Xk  =x}  =  -•4(x) - |x  |/wsk(x,x  +  bk  |x  lw)dN(0,I)(w) 

ak 


(4.27) 


and 


(4.28) 


E{%  <S)  %  |Xk  =  x} 


bk 


ak 


|xPl 


^bk^ 


ak 


|x  P/w  0  wsk(x,x  +  bk  |x  |w)dN(0,I)(w) 


+  0 


bk 


(4.29) 


Now  (4.28)  and  (4.29)  are  the  same  equations  as  (4.9)  and  (4.16),  respectively,  with 
VU(x)  replaced  by  'fpkix)  and  ^k  replaced  by  T^k  (except  for  the  error  terms).  Lemma  4 
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now  follows  by  the  same  procedure  as  in  the  proof  of  Lemma  2  except  now 
i/^(x)  =  0(  lx  l/a2)  instead  of  VU(x)  =  0{  |x  |). 


□ 


-  28  - 


5.  REFERENCES 

[1]  Kirkpatrick,  S.,  Gelatt  C.  D.,  and  Vecchi,  M.,  Optimization  by  Simulated  Anneal¬ 
ing,  Science,  Vol.  220,  pp.  621-680,  1983. 

[2]  Cerny,  V.,  A  Thermodynamical  Approach  to  the  Travelling  Salesman  Problem, 
Journal  of  Optimization  Theory  and  Applications,  Vol.  45,  pp.  41-51,  1985. 

[3]  Binder,  K.,  Monte  Carlo  Methods  in  Statistical  Physics,  Springer  Verlag,  Berlin, 
1978. 

[4]  Geman,  S.,  and  Geman,  D.,  Stochastic  Relaxation,  Gibbs  Distributions,  and  the 
Bayesian  Restoration  of  Images,  IEEE  Transactions  on  Pattern  Analysis  and 
Machine  Intelligence,  Vol.  PAMI-6,  pp.  721-741,  1984. 

[5]  Gidas,  B.,  Nonstationary  Markov  Chains  and  Convergence  of  the  Annealing  Algo¬ 
rithm,  Journal  of  Statistical  Physics,  Vol.  39,  pp.  73-131,  1985. 

[6]  Hajek,  B.,  Cooling  Schedules  for  Optimal  Annealing,  Mathematics  of  Operations 
Research,  Vol.  13,  pp.  311-329,  1988. 

[7]  Mitra,  D.,  Romeo,  F.,  and  Sangiovanni-Vincentelli,  A.,  Convergence  and  Finite- 
Time  Behavior  of  Simulated  Annealing,  Advances  in  Applied  Probability,  Vol.  18, 
pp.  747-771,  1986. 

[8]  Tsitsiklis,  J.,  Markov  Chains  with  Rare  Transitions  and  Simulated  Annealing, 
Mathematics  of  Operations  Research,  Vol.  14,  pp.  70-90,  1989. 

[9]  Tsitsiklis,  J.,  A  Survey  of  Large  Time  Asymptotics  of  Simulated  Annealing  Algo¬ 
rithms,  Massachusetts  Institute  of  Technology,  Laboratory  for  Information  and 
Decision  Systems,  Report  No.  LIDS-P-1623,  1986. 

[10]  Geman,  S.  and  Hwang,  C.  R.,  Diffusions  for  Global  Optimization,  SIAM  Journal 
Control  and  Optimization,  24,  pp.  1031-1043,  1986. 

[11]  Grenender,  U.,  Tutorial  in  Pattern  Theory,  Div.  Applied  Mathematics,  Brown 
Univ.,  Providence,  RI,  1984. 

[12]  Chiang,  T.  S.,  Hwang,  C.  R.  and  Sheu,  S.  J.,  Diffusion  for  Global  Optimization  in 
JR",  SIAM  Journal  Control  and  Optimization,  25,  pp.  737-752,  1987. 

[13]  Gidas,  B.,  Global  Optimization  via  the  Langevin  Equation,  Proc.  IEEE  Conference 
on  Decision  and  Control,  1985. 


-  29  - 


[14]  Kushner,  H.  J.,  Asymptotic  Global  Behavior  for  Stochastic  Approximation  and 
Diffusions  with  Slowly  Decreasing  Noise  Effects:  Global  Minimization  via  Monte 
Carlo,  SIAM  Journal  Applied  Mathematics,  47,  pp.  169-185,  1987. 

[15]  Gelfand,  S.  B.  and  Mitter,  S.  K.,  Recursive  Stochastic  Algorithms  for  Global 
Optimization  inlR^,  submitted  to  SIAM  Journal  Control  and  Optimization,  1990. 

[16]  Kushner,  H.  J.  and  Clark,  D.,  Stochastic  Approximation  Methods  for  Constrained 
and  Unconstrained  Systems,  Springer,  Berlin,  1978. 

[17]  Ljung,  L.,  Analysis  of  Recursive  Stochastic  Algorithms,  IEEE  Transactions  on 
Automatic  Control,  Vol.  AC-22,  pp.  551-575,  1977. 

[18]  Brook,  D.  G.  and  Verdini,  W.  A.,  Computational  Experience  with  Generalized 
Simulated  Annealing  Over  Continuous  Variables,  American  Journal  of  Mathemati¬ 
cal  and  Management  Sciences,  Vol.  8,  Nos.  3  and  4,  pp.  425-449,  1988. 

[19]  Jeng,  F.  C.  and  Woods,  J.  W.,  Simulated  Annealing  in  Compound  Gaussian  Ran¬ 
dom  Fields,  IEEE  Transactions  on  Information  Theory,  Vol.  IT-36,  pp.  94-107, 
1990. 

[20]  Hwang,  C.  R.,  Laplaces  Method  Revisited:  Weak  Convergence  of  Probability 
Measures,  Annals  of  Probability,  8,  pp.  1177-1182,  1980. 

[21]  Doob,  J.  L.,  Stochastic  Processes,  Wiley,  New  York,  1953. 


* 


Figure  1.  Three  transition  probability  densities 


